Representing Semantic Relationships in Ancient IE Languages

A Pilot Study

Anton Vinogradov, Gabriel Wallace, and Andrew Byrd

University of Kentucky

2024-10-25

Overview of Talk

  • What led us to do this?
  • Overview of Word Embeddings
  • Reconstructing Word Embeddings using Descendant Languages
  • Future Directions & DERBi PIE

Anton sends his regrets

Dr. Anton Vinogradov,(recent!) PhD, Computer Science

What led us to do this?

DERBi PIE

Database of Etymological Roots Beginning in PIE

DERBi PIE

  • Etymological database, with multiple, linked references
    • all of LIV parsed (thanks Thomas Olander!)
    • half of Pokorny fully parsed (will finish next July, provided funding)
    • will finish parsing NIL this December
    • ultimately: everything (?!?)
  • Have applied for NEH funding, hopefully this will put us closer to the goal of releasing to public a year from now.

DERBi PIE: Query Searches

  • Search Functions Created (or we know how to)
    • Integrated Texts - identify roots, stems, and words in texts
    • Phonological Search - identify roots, stems, and words by phonological shape (regex)
    • Morphological Search - identify roots, stems, and words by morphological property (POS, class, gender, etc.)

DERBi PIE: Query Searches


  • Quickly realized that to identify semantic categories and relationships – especially ones that make sense across languages – is not an easy task
    • Given cross-linguistic variation, there is no unified classification of “words”

DERBi PIE: What could this do for us?

  • In DERBi PIE, having an integrated semantic system could provide automated answers to questions such as:

    • Are certain sound sequences associated with certain meanings or semantic spheres?

      • Sound symbolism, etc.
    • Morphological classes or derivations?

    • How have meanings changed over time into the various branches and daughter languages?

DERBi PIE: the Problem for Today




  • How to create a system of semantic properties and relationships that translates across IE languages?

Overview of Word Embeddings

Word Embeddings: Overview

  • Distributional Analysis: “You shall know a word by the company it keeps.” - John Rupert Firth

  • For example, you are very likely to see the words “dog” and “leash” appear near each other in a text.

  • However, you are less likely to see the words “dog” and “physics” as near each other.

    • Why is that?

Word Embeddings: Overview

  • There exist programming tools that allow us to generate vectors (essentially coordinates) based on a word’s position within a text and proximity to other words.

  • All of the vectors that can be generated from a text would lie in a semantic space or hyperspace.

Word Embeddings: Overview

  • This approach in line with what is done in present-day NLP – identifying semantic relationships through word embeddings.

  • Before we can run a text through one of these tools, we must first discuss tokenization and lemmatization.

Word Embeddings: Processing the Text for Tokens

  • Tokenization is the process of breaking a text into chunks.

Word Embeddings: Processing the Text for Lemmas

  • Lemmatization is the process of reducing multiple forms to a single one.

Word Embeddings: Identifying the Context

  • The word embeddings tool can analyze what words appear commonly near another word.

Word Embeddings: Constructing a Hyperspace

  • You can tinker with how exactly a word embeddings tool generates these vectors, for example:

    • How many dimensions will it generate for each vector?

    • How far to each side of a word will it look?

    • How many times will it run through the text?

Word Embeddings: Example of a Hyperspace

Plots like these can be created with generated vectors!

Word Embeddings: Example of a Vector

  • Another Latin example:

Word Embeddings: Numerical Value of Semantic Similarity

  • To get a numerical representation of the similarity between two vectors, we use their cosine similarity (cosine of the angle between the vectors).
  • Similarity scores range from -1 to 1, with -1 indicating opposite vectors and 1 indicating proportional vectors.

Fancy formula from Wikipedia

Word Embeddings: Example of Cosine Similarity

  • This score suggests that these two words are more similar than not.

Word Embeddings: Reconstructing a Hyperspace in Languages without a Corpus

  • We can construct a hyperspace for ancient languages;
  • But how does one do this for languages without any known corpus, such as PIE?
  • Well, how do we identify other properties of PIE?

Word Embeddings : Basic Idea


  • We take two, similar properties in two (or more) related languages, which allow us to approximate an earlier state in a source language

Word Embeddings : Basic Idea

  • It is in this way that we propose that the use of word embedding models created by descendant languages, to approximate an earlier state in the source language

Reconstructing Word Embeddings using Descendant Languages

Word Embeddings: Methods

  • As you can imagine, this stuff is complicated, which is why we won’t go into much detail about the specific methods;
    • see Github for four-page paper, code, data, etc.

Word Embeddings, Problem #1: Verification


  • So we take descendant hyperspaces, and reconstruct an earlier, source hyperspace based on the hyperspaces provided
  • But how can we trust this methodology?
    • Obviously can’t verify the hyperspace of PIE through analysis of PIE texts!

Word Embeddings, Problem #1: Verification

Word Embeddings, Problem #2: Vectors Across Models

  • Vectors generated for hyperspaces take on arbitrary values when training models
    • So ‘dog’ could = (0, 0) or (-6, 100) – these values change every time you run the model
  • For this reason, we must align models (Dev et al., 2021):
    • Identify substructures across language models that remain fixed
    • Use pre-aligned word embedding models (following Joulin et al., 2018)

Word Embeddings: Methods

  • Models trained on French & Spanish Wikipedia articles
  • Generate aligned word vectors using fastText (Bojanowski et al., 2017)
  • Filter out certain words:
    • If there is no corresponding word in the other languages
    • Non-vocabulary, including words with non-language characters
  • Remaining words lemmatized, further reducing vocabulary by roughly 10%

Word Embeddings: Methods


  • To relate words together and find common words:
    • Words are translated into each other’s respective languages, using Google Translate, with a Python translation library deep-translator
    • The same is done with Latin, using the Latin corpus (from CLTK [the Tesserae Project]), which is lemmatized
  • Any word that cannot be lemmatized in Latin (such as Greek words) is removed from the corpus

Word Embeddings: Calculating *Latin

  1. If there’s a 1:1 correspondence between the Romance language & Latin: language-word center: French être : Latin esse
  1. If there isn’t a 1:1 correspondence, identify the centroid of lemma’s vector: language-word center: French frêle, fragile : Latin fragilis

  2. Identify the centroid of both lwcs -> inter-language-word center

Word Embeddings: Calculating *Latin

  • With a better model/better translations, this is all we’d need.
  • Two extra steps were added by Anton to take into account translation errors and unexpected outliers.
  1. We identify the closest vectors to the ilwc using cosine-distance.

  1. Take the average of these two vectors to arrive at the approximate *Latin word vector

Word Embeddings: the “Normal” Model

  • To identify the effectiveness of the method, recall that we want to compare the *Latin with the Latin
  • But how do we know how well each model performs?

Word Embeddings: How to Assess the Goodness of Your Model

  • We follow Stringham & Izbicki 2020 in using the OddOneOut task, which has been demonstrated to be more accurate with languages with small corpora (LRLs), especially ones that aren’t modern
  • OddOneOut task demonstrated to work for languages with corpus as small as 1800 tokens (Old Gujarati)

Word Embeddings: OddOneOut (Stringham & Izbicki 2020)

  • The OddOneOut technique “measures a model’s ability to identify words that are unrelated to [a specific] category”

One of these things is not like the other!

Word Embeddings: OddOneOut (Stringham & Izbicki 2020)

  • Categories are automatically drawn from Wikidata and implemented when applicable.
  • Example categories include (ibid., 181):
    • Human Biblical Figures: Abednego, Abraham, Azor, Chedorlaomer, Christ, Goliath, Hilkiah, Lo-Ammi, Matthew the Apostle, Peter, Sheba, Uz, Yael, Zerubbabel
    • Hinduism: Adharma, Ashvadamedha, Brahmin, Bhagavan, Guru, Hindu Prayer Beads, Karma, Mahātmā, Mudra, Nirvana

Word Embeddings: Results

  • Our results indicate mixed success: we can create a hyperspace from the descendant languages that performs reasonably well on Latin tests with OddOneOut
  • Preliminary tests show that the smaller the corpus, the better the model works, as compared to the normal one

Word Embeddings: Results

  • This is not exactly a bad thing, as this is the case for many languages in our discipline!
    • If a language has a large corpus (such as Ancient Greek), we can reduce it as we’ve done here for Latin
    • To train such models, reductions of tokens are done randomly, both with regards to the number of tokens and specific tokens used

Word Embeddings: Results


  • Unclear what is the “magic number” for the size of corpus to arrive at most accurate representation (compared to the normal model)
  • Unclear exactly why performance of descendant models tends to decrease as the corpus size increases
    • Anton has ideas, all model-/programming-specific – see the draft for further

Future Directions & DERBi PIE

Recap: Descendant Model


  • In this talk, we implement the Comparative Method to reconstruct the hyperspaces of languages without corpora
  • Upsides:
    1. fully automated
    2. requires low CPU processing
  • Downsides:
    1. doesn’t distinguish multiple senses (polysemy)
    2. only performs well on smaller corpora (compared to normal model)

Word Embeddings: Problems with Current Model (Anonymous NLP reviewers)

  1. Use of Google translate suboptimal, may result in translation errors; should use either bilingual dictionaries or LLMs (think GPT) for more accurate translations
  2. LLMs > Word Embeddings
    1. Vectors: polysemy (e.g., ‘bank’)
    2. Vectors: context
    3. Vectors: precision (300 vs. 175B parameters)

Future Directions: using LLM models

  • We believe that our reconstruction model shows promise

  • We plan to stick with the Descendant Model strategy, but:

    • Utilize LLMs (such as GPT) for modelling (hyperspace) for greater precision and differentiation of polysemy

    • Instead of Google Translate, use bilingual dictionary or LLM

      • We’ve had great success with LLMs in the parsing of data for DERBi PIE

Future Directions: using LLM models

  • Add additional Romance languages;

  • When happy with results, move on to other subbranches (likely Slavic or Indic) for testing;

  • We believe that a good, workable model should be able to generate a *proto-vector through the centroid of the descendant language-word centers (minus those additional steps).

Future Directions: Integration into DERBi PIE

  • Let’s finish with where we began – DERBi PIE

  • With aligned hyperspaces generated for each descendant and reconstructed language, our models would give us coordinates for each lexeme in a language

Future Directions: Integration into DERBi PIE

  • Possible question: “How semantically similar are PIE *sC- roots – as compared to related *C- roots?” (*(s)peḱ- ‘look at’)

    • Calculate cosine similarities!
  • Example from English:

    • bl- words: “blue”, “blaze”, “bland”, “blush”, “blink”, “blow”, “blast”, “blot”, “blend”, “bleak”

    • These words have a cosine similarity score of .8016 (remember: perfect score is 1.0!)

Thank you!

Download: Slides, Paper, Code